Learning accurate personalized survival models for predicting hospital discharge and mortality of COVID-19 patients

Since it emerged in December of 2019, COVID-19 has placed a huge burden on medical care in countries throughout the world, as it led to a huge number of hospitalizations and mortalities. Many medical centers were overloaded, as their intensive care units and auxiliary protection resources proved insufficient, which made the effective allocation of medical resources an urgent matter. This study describes learned survival prediction models that could help medical professionals make effective decisions regarding patient triage and resource allocation. We created multiple data subsets from a publicly available COVID-19 epidemiological dataset to evaluate the effectiveness of various combinations of covariates—age, sex, geographic location, and chronic disease status—in learning survival models (here, “Individual Survival Distributions”; ISDs) for hospital discharge and also for death events. We then supplemented our datasets with demographic and economic information to obtain potentially more accurate survival models. Our extensive experiments compared several ISD models, using various measures. These results show that the “gradient boosting Cox machine” algorithm outperformed the competing techniques, in terms of these performance evaluation metrics, for predicting both an individual’s likelihood of hospital discharge and COVID-19 mortality. Our curated datasets and code base are available at our Github repository for reproducing the results reported in this paper and for supporting future research.

where β j is the learned weight for j-th covariate and λ 0 (t) is the time-dependent baseline hazard function, which is usually unspecified (as it is the same across all patients). The value obtained from the exponential of the Cox-PH model is treated as personalized time-invariant risk score. The Kalbfleisch-Prentice extension (Cox-KP) fits a non-parametric baseline function for computing λ 0 (t), which can then be used to generate the individual survival curve across all time-points [20]. This paper compares these Cox-KP models with other ISD models.

A.3 Multi-task Logistic Regression
Yu et al. proposed the multi-task logistic regression (MTLR) model to compute ISDs by overcoming the limitations of the proportional hazard assumption of the Cox-PH model [21]. MTLR divides the study time period into discrete-time intervals and trains essentially a logistic regression model for each interval. The output for each logistic regression estimator is a binary variable y i ∈ {0, 1}, indicating whether this individual is alive in that time interval. The complete ISD is then calculated by combining the logistic regression computation of each interval: where each θ i represents logistic regression parameters and b i is the bias term, x is a vector of an individual's input covariates, and y = [y i , y 2 , . . . , y m ] ∈ {0, 1} m is an individual's event indicator for each interval. Note that all the subscripts in this equation represent the index of the time interval.

B Evaluation Metrics
This paper uses the following performance metrics:

B.1 Concordance Index (C-index)
Concordance index, also known as C-index, considers all pairs of "comparable" instances, and computes the proportion of these pairs of instances whose actual pair-wise survival ordering, matches the predicted ordering of the survival model as: C-index(Ŝ( · | · ), D ) = ∑ i:δ i =1 ∑ j:t i <t j 1[Med(Ŝ(·|x i ) ) < Med(Ŝ( · | x j )) )] ∑ i:δ i =1 ∑ j:t i <t j 1 ( 3) where the denominator is the number of "comparable" pairs, Med(Ŝ( · | x )) is the median survival time of the survival distribution S( · | x )), 1[φ] is the indicator function, which is 1 if the proposition φ is true (otherwise 0), and δ i = 1 means the i-th patient is uncensored. A pair of patients is "comparable" if we can determine who died (or was discharged) first -i.e., if both are uncensored, or when one patient is censored after the observed (uncensored) event time of the other; this corresponds to the set of ordered pairs of indices. The C-index score is a real value between 0 to 1, where 1 means all comparable pairs are predicted correctly. C-index only measures the discriminative ability of a survival model and considers only a (possibly small) fraction of all pairs of patients.

B.2 D-Calibration
Standard "1-calibration" measures the deviation between the observed and the predicted probabilities, over all instances, for a single time point -for example, the probability of dying from COVID-19 in 10 days. But this only considers a single time; instead, distribution calibration (D-calibration) is a way to evaluate the calibration of survival prediction models that produce ISDs (which give probabilities for all future times) [16].
be the subset of patients in D whose time of death is assigned a probability (by should include half (i.e., 0.5 − 0) of the patients, and V Θ,D ([0.5, 0.75]) should include 0.75 − 0.5 = 1 4 of the patients, etc. We can compare the distributions of predicted and observed proportions of events using χ 2 goodness-of-fit test [32]. A well-calibrated model will have a large p-value for D-Calibration, which indicates that the distribution of observed proportions of events is statistically similar to the proportion of predicted proportions. Haider et al. provides a method to incorporate censored individuals into the D-calibration calculation by appropriately "spreading" each censored individual among the relevant time intervals [16].

B.3 L1-Margin Loss
L1-loss is another metric to compare survival prediction models by computing the difference between the observed event time and the predicted median survival time for uncensored instances. To incorporate censored instances into L1-loss calculation, we used the L1-Margin variant of this loss as described in [16]. For an individual censored at time c i , L1-Margin loss sets the event time for this patient as which is a "Best-Guess" value based on the censored time c i , which uses the Kaplan-Meier curve estimated from the training dataset, S KM (t). (Note this corresponds to the expected value, given the survival time is at least c i -see [16, Theorem B.1] However, this "Best-Guess" survival time can be more meaningful for some instances than others, based on the censored time c i . For example, we know effectively nothing about a patient x e censored at time c e = 0 -meaning we should have very little confidence that BG(0) matches x e 's true survival time. By contrast, imagine no patients lived more than 1000 days in our dataset. If patient x k was censored at time c k = 995 (i.e., close to this largest known survival time), we are fairly confident that this BG(c k ) is close to the observed event time. Therefore, Haider et al.
[16] set a confidence weight α k = 1 − S KM (c k ) for each "Best-Guess" estimation, which yields lower confidence for early censoring data and higher confidence for late censoring data.

B.3.1 Marginal Concordance index
All censored individuals could be included into the C-index calculation by using the aforementioned "Best-Guess"-value-based de-censoring approach, Equation 5. We define marginal C-index (mC-index) as the C-index variant that includes all pairs of censored and uncensored (after de-censoring) individuals into its calculation. Moreover, each pair of individuals in the mC-index calculation is assigned a confidence weight of 1 for pairs that were comparable in the original definition, 1 − S KM (c i ) for original incomparable pairs with one censored individual, i, and (1 − S KM (c i )) · (1 − S KM (c j )) for original incomparable pairs with two censored individuals, i and j. Note that all results reported in the main paper and the Appendix use the standard C-index to be comparable with many other papers; the mC-index results are provided only in the supplementary material.  Figure 3 (resp., Figure 4, Figure 5) in the main text. For patient mortality as the event of interest in dataset D2 -i.e., D2[d] -all models were calibrated and RSF had marginally better C-index than the GBCM-KP model while having a worse L1-Margin loss; see Table C.4. Although MTLR exhibits the lowest L1-Margin loss among all the models, its C-index performance is not as good as RSF's and GBCM-KP's. We obtained substantially better results when we considered death as the event of interest instead of hospital discharge in dataset D2 for all models; this is probably due to the relatively high censoring rate for patient mortality as shown in Table 1 in the main text. For patient mortality as an event of interest in dataset D3 -D3[d] -including PD and GDP as meta information only marginally improved the results of all survival models as shown in Table C.7. The results of ISD models for mortality (Table C.7) are not impacted as significantly as the results for hospital discharge (Table C.3) due to relatively high censoring.